446 research outputs found

    Introducing a framework to assess newly created questions with Natural Language Processing

    Full text link
    Statistical models such as those derived from Item Response Theory (IRT) enable the assessment of students on a specific subject, which can be useful for several purposes (e.g., learning path customization, drop-out prediction). However, the questions have to be assessed as well and, although it is possible to estimate with IRT the characteristics of questions that have already been answered by several students, this technique cannot be used on newly generated questions. In this paper, we propose a framework to train and evaluate models for estimating the difficulty and discrimination of newly created Multiple Choice Questions by extracting meaningful features from the text of the question and of the possible choices. We implement one model using this framework and test it on a real-world dataset provided by CloudAcademy, showing that it outperforms previously proposed models, reducing by 6.7% the RMSE for difficulty estimation and by 10.8% the RMSE for discrimination estimation. We also present the results of an ablation study performed to support our features choice and to show the effects of different characteristics of the questions' text on difficulty and discrimination.Comment: Accepted at the International Conference of Artificial Intelligence in Educatio

    A meeting report: OECD-GESIS Seminar on Translating and Adapting Instruments in Large-Scale Assessments (2018)

    Get PDF
    This report summarizes the main themes and conclusions from the OECD-GESIS Seminar on Translating and Adapting Instruments in Large-Scale Assessments, which took place at the Organization for Economic Co-operation and Development (OECD), Paris, in June 2018. The five sessions covered the topics (1) etic (universal) vs. emic (culture-specific) measurement instruments, (2) language- and culture-sensitive development of measurement instruments, (3) international guidelines vs. implementation in countries and by translators, (4) tools and technological developments, and (5) quality control of translations. Key players in the field presented on best practice, lessons learned, and innovations and also made suggestions for moving the field forward

    Using item response theory to explore the psychometric properties of extended matching questions examination in undergraduate medical education

    Get PDF
    BACKGROUND: As assessment has been shown to direct learning, it is critical that the examinations developed to test clinical competence in medical undergraduates are valid and reliable. The use of extended matching questions (EMQ) has been advocated to overcome some of the criticisms of using multiple-choice questions to test factual and applied knowledge. METHODS: We analysed the results from the Extended Matching Questions Examination taken by 4th year undergraduate medical students in the academic year 2001 to 2002. Rasch analysis was used to examine whether the set of questions used in the examination mapped on to a unidimensional scale, the degree of difficulty of questions within and between the various medical and surgical specialties and the pattern of responses within individual questions to assess the impact of the distractor options. RESULTS: Analysis of a subset of items and of the full examination demonstrated internal construct validity and the absence of bias on the majority of questions. Three main patterns of response selection were identified. CONCLUSION: Modern psychometric methods based upon the work of Rasch provide a useful approach to the calibration and analysis of EMQ undergraduate medical assessments. The approach allows for a formal test of the unidimensionality of the questions and thus the validity of the summed score. Given the metric calibration which follows fit to the model, it also allows for the establishment of items banks to facilitate continuity and equity in exam standards

    A collaborative comparison of Objective Structured Clinical Examination (OSCE) standard setting methods at Australian medical schools

    Get PDF
    Background: A key issue underpinning the usefulness of the OSCE assessment to medical education is standard-setting, but the majority of standard-setting methods remain challenging for performance assessment because they produce varying passing marks. Several studies have compared standard setting methods; however, most of these studies are limited by their experimental scope, or use data on examinee performance at a single OSCE station or from a single medical school. This collaborative study between ten Australian medical schools investigated the effect of standard-setting methods on OSCE cut scores and failure rates. Methods: This research used 5,256 examinee scores from seven shared OSCE stations to calculate cut scores and failure rates using two different compromise standard-setting methods, namely the Borderline Regression and Cohen's methods. Results: The results of this study indicate that Cohen's method yields similar outcomes to the Borderline Regression method, particularly for large examinee cohort sizes. However, with lower examinee numbers on a station, the Borderline Regression method resulted in higher cut scores and larger difference margins in the failure rates. Conclusion: Cohen's method yields similar outcomes as the Borderline Regression method and its application for benchmarking purposes and in resource-limited settings is justifiable, particularly with large examinee numbers

    Testing and Assessment in an International Context: Cross- and Multi-cultural Issues

    Get PDF
    Globalisation, increase of migration flows, and the concurrent worldwide competitiveness impose rethinking of testing and assessment procedures and practices in an international and multicultural context. This chapter reviews the methodological and practical implications for psychological assessment in the field of career guidance. The methodological implications are numerous and several aspects have to be considered, such as cross-cultural equivalence or construct, method, and item bias. Moreover, the construct of culture by itself is difficult to define and difficult to measure. In order to provide non-discriminatory assessment, counsellors should develop their clinical cross-cultural competencies, develop more specific intervention strategies, and respect cultural differences. Several suggestions are given concerning translation and adaptation of psychological instruments, developing culture specific measures, and the use of these instruments. More research in this field should use mixed methods, multi-centric designs, and consider emic and etic psychological variables. A multidisciplinary approach might also allow identifying culture specific and ecological meaningful constructs. Non-discriminatory assessment implies considering the influence and interaction of personal characteristics and environmental factors

    Design and Key Features of the PIAAC Survey of Adults

    Get PDF
    This chapter gives an overview of the most important features of the Programme for the International Assessment of Adult Competencies (PIAAC) survey as it pertains to two main goals. First, only a well-designed survey will lead to accurate and comparable test scores across different countries and languages both within and across assessment cycles. Second, only an understanding of its complex survey design will lead to proper use of the PIAAC data in secondary analyses and meaningful interpretation of results by psychometricians, data analysts, scientists, and policymakers. The chapter begins with a brief introduction to the PIAAC survey followed by an overview of the background questionnaire and the cognitive measures. The cognitive measures are then compared to what was assessed in previous international adult surveys. Key features of the assessment design are discussed followed by a section describing what could be done to improve future PIAAC cycles
    corecore